Code
# import libraries
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as goIn this notebook, we perform an in-depth exploratory and descriptive analysis of the Crimes Against Women dataset, which contains reported cases across Indian states and union territories from the years 2001 to 2021. This dataset provides critical insights into the frequency, types, and distribution of crimes committed against women, making it a valuable resource for policy formulation, social awareness, and data-driven intervention planning.
The purpose of this analysis is to uncover patterns, identify high-risk regions, examine the most prevalent types of crimes, and understand long-term trends in gender-based violence. We place particular emphasis on visualizing changes over time, comparing crime volumes across different states, and assessing the impact of specific types of violence such as domestic abuse, assault on modesty, rape, and trafficking.
By analyzing the cleaned and reshaped dataset, we aim to build a strong foundation for more advanced statistical or predictive modeling and inform future policy recommendations or public safety initiatives.
We begin our analysis by importing essential Python libraries used for data manipulation, visualization, and file handling:
pandas: Used for loading, cleaning, and reshaping structured tabular data, allowing efficient data analysis workflows.
numpy: Supports numerical operations and handling of missing or inconsistent data values.
os: Helps manage file paths and directories during the analysis and export process.
plotly.express: A powerful graphing library used to create interactive and insightful visualizations that reveal key patterns and trends across states, years, and crime categories.
# import libraries
import os
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as goTo ensure reproducibility and organized storage, we programmatically create directories if they don’t already exist for:
These directories will store intermediate and final outputs for reproducibility.
We load the cleaned version of the Crimes Against Women dataset from the processed data directory into a Pandas DataFrame. The head(10) function is used to display the first ten records, providing a quick view of the key columns such as state, crime_type, year This helps verify that the dataset has been properly cleaned and reshaped into a long format, suitable for analysis and visualization.
Before performing analysis, we check the overall structure of the dataset:
df.shape.df.dtypes.year from object to int)This summary helps us understand the diversity and distribution of states and crime categories recorded in the dataset.
| Statistic | Year | Value (Number of Cases) |
|---|---|---|
| Count | 5,152 | 5,152 |
| Mean | 2011.15 | 944.82 |
| Std Dev | 6.05 | 2,174.42 |
| Min | 2001 | 0 |
| 25% | 2006 | 4 |
| Median | 2011 | 79.5 |
| 75% | 2016 | 814.25 |
| Max | 2021 | 23,278 |
These summary statistics help us understand the scale and variability of crimes against women across states and years, setting a foundation for deeper trend and regional analysis.
The dataset contains two categorical columns: State and Crime Type.
This summary helps us understand the diversity and distribution of states and crime categories recorded in the dataset.
| Statistic | State | Crime Type |
|---|---|---|
| Count | 5152 | 5152 |
| Unique | 37 | 7 |
| Top | andhra pradesh | No. of Rape cases |
| Freq | 147 | 736 |
The following table (or list) shows the count of each type of crime recorded in the dataset under the Crime Type column: This distribution helps identify which crimes against women are most frequently reported in the dataset.
| Crime Type | Count |
|---|---|
| No. of Rape cases | 736 |
| Kidnap And Assault | 736 |
| Dowry Deaths | 736 |
| Assault against women | 736 |
| Assault against modesty of women | 736 |
| Domestic violence | 736 |
| Women Trafficking | 736 |
| Crime Type | Proportion |
|---|---|
| No. of Rape cases | 0.142857 |
| Kidnap And Assault | 0.142857 |
| Dowry Deaths | 0.142857 |
| Assault against women | 0.142857 |
| Assault against modesty of women | 0.142857 |
| Domestic violence | 0.142857 |
| Women Trafficking | 0.142857 |
| State | Proportion |
|---|---|
| andhra pradesh | 0.028533 |
| uttar pradesh | 0.028533 |
| odisha | 0.028533 |
| punjab | 0.028533 |
| rajasthan | 0.028533 |
| sikkim | 0.028533 |
| tamil nadu | 0.028533 |
| tripura | 0.028533 |
| uttarakhand | 0.028533 |
| mizoram | 0.028533 |
| west bengal | 0.028533 |
| a & n islands | 0.028533 |
| chandigarh | 0.028533 |
| daman & diu | 0.028533 |
| lakshadweep | 0.028533 |
| puducherry | 0.028533 |
| arunachal pradesh | 0.028533 |
| nagaland | 0.028533 |
| meghalaya | 0.028533 |
| himachal pradesh | 0.028533 |
| assam | 0.028533 |
| bihar | 0.028533 |
| chhattisgarh | 0.028533 |
| goa | 0.028533 |
| gujarat | 0.028533 |
| manipur | 0.028533 |
| haryana | 0.028533 |
| jammu & kashmir | 0.028533 |
| jharkhand | 0.028533 |
| karnataka | 0.028533 |
| kerala | 0.028533 |
| madhya pradesh | 0.028533 |
| maharashtra | 0.028533 |
| telangana | 0.014946 |
| d&n haveli | 0.014946 |
| delhi ut | 0.014946 |
| d & n haveli | 0.013587 |
Presenting the data visually like this aids in identifying key areas to focus on when addressing these crimes.
fig = px.pie(
crime_2020,
names='Crime Type',
values='Value',
title=f'Crime Distribution by Type in {year}',
hole=0.4,
template='presentation',
color_discrete_sequence=px.colors.sequential.Greens_r
)
fig.update_traces(
textinfo='percent',
textposition='inside',
hoverinfo='label+percent'
)
fig.update_layout(
paper_bgcolor="rgba(9,0,8,0)",
plot_bgcolor="rgba(0,0,0,0)"
)
fig.write_image(os.path.join(results_dir, 'pie_chart.jpg'))
fig.write_image(os.path.join(results_dir, 'pie_chart.png'))
fig.write_html(os.path.join(results_dir, 'pie_chart.html'))
fig.show()This pie chart shows the proportion of different types of crimes against women recorded in the dataset.
fig = px.bar(
crime_by_state,
x='State',
y='Value',
color='Crime Type',
title='Crime Distribution in Top 5 States',
barmode='stack',
text='Value',
height=1000,
width=1200,
template='presentation',
color_discrete_sequence=px.colors.sequential.Greens_r
)
fig.update_traces(textposition='outside')
fig.update_layout(
xaxis_tickangle=23,
margin=dict(l=100, r=50, t=50, b=50)
)
fig.write_image(os.path.join(results_dir, 'Top5_most_reported_bar_plot.jpg'))
fig.write_image(os.path.join(results_dir, 'Top5_most_reported_bar_plot.png'))
fig.write_html(os.path.join(results_dir, 'Top5_most_reported_bar_plot.html'))
fig.show()bar chart illustrates the total number of reported crimes against women across the top 5 states, broken down by different crime types.
The line chart titled “Crime Trends Over Years” illustrates how different types of crimes against women have changed over time.
Value) for each crime type.fig2 = px.line(
crime_over_years,
x='Year',
y='Value',
color='Crime Type',
markers=True,
title='Crime Trends Over Years',
width=900,
height=500,
template='presentation',
color_discrete_sequence=px.colors.sequential.Greens_r
)
fig2.update_traces(mode='lines+markers')
fig2.update_layout(
xaxis_tickangle=0,
paper_bgcolor="rgba(9,0,8,0)",
plot_bgcolor="rgba(0,0,0,0)"
)
fig2.write_image(os.path.join(results_dir, 'crime_trends_over_years.jpg'))
fig2.write_image(os.path.join(results_dir, 'crime_trends_over_years.png'))
fig2.write_html(os.path.join(results_dir, 'crime_trends_over_years.html'))
fig2.show()The chart helps identify whether certain crimes have increased, decreased, or remained stable over the years.
By comparing trends across crime types, we can see which issues have become more or less prominent over time.
This visualization is useful for policymakers, researchers, and social organizations to understand the evolution of gender-based violence and guide targeted interventions based on historical data.
fig = px.line(
crime_trend_by_year,
x='Year',
y='Value',
color='Crime Type',
markers=True,
title='Trend of Top 3 Crime Types Over the Years',
labels={'Value': 'Total Cases'},
height=500,
width=900,
template='presentation',
color_discrete_sequence=px.colors.sequential.Greens_r
)
fig.update_traces(mode='lines+markers')
fig.update_layout(
xaxis_tickangle=0,
paper_bgcolor="rgba(9,0,8,0)",
plot_bgcolor="rgba(0,0,0,0)",
legend_title_text='Crime Type'
)
fig.write_image(os.path.join(results_dir, 'top3_crime_trends_over_years.jpg'))
fig.write_image(os.path.join(results_dir, 'top3_crime_trends_over_years.png'))
fig.write_html(os.path.join(results_dir, 'top3_crime_trends_over_years.html'))
fig.show()fig = px.bar(
top_crimes,
x='Value',
y='Crime Type',
orientation='h',
title='Top 10 Most Reported Crime Types',
text='Value',
color_discrete_sequence=['seagreen'],
height=600,
width=1000,
template='presentation'
)
fig.update_traces(textposition='outside')
fig.update_layout(
yaxis=dict(categoryorder='total ascending'),
margin=dict(l=350, r=10, t=100, b=100)
)
fig.write_image(os.path.join(results_dir, 'Top10_most_reported_bar_plot.jpg'))
fig.write_image(os.path.join(results_dir, 'Top10_most_reported_bar_plot.png'))
fig.write_html(os.path.join(results_dir, 'Top10_most_reported_bar_plot.html'))
fig.show()This horizontal bar chart visualizes the top 10 most frequently reported crimes based on the number of recorded cases.
The y-axis lists the crime types, ordered by total cases in ascending order.
The x-axis represents the number of reported cases (Value).
Each bar is labeled with the exact count and is color-coded by magnitude for better visual emphasis.
This chart clearly highlights which types of crimes are most prevalent in the dataset. and the most is Domestic Viloence
It helps to quickly identify priority areas for policy intervention, awareness, and law enforcement efforts.
The use of a horizontal layout improves readability, especially for longer crime type names.
This visualization is useful for summarizing the most critical forms of violence against women and supporting targeted actions
In this section, we apply two key statistical methods to the crime dataset:
State and Crime Type to examine whether there is a statistically significant association between the state where a crime occurred and the type of crime reported.
The results provide statistical backing for visual trends observed earlier and offer direction for targeted interventions or future modeling.
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency, spearmanr
contingency = pd.pivot_table(crime_df, values='Value', index='State', columns='Crime Type', aggfunc='sum', fill_value=0)
chi2, p, dof, expected = chi2_contingency(contingency)
print(f"Chi-square test between State and Crime Type:")
print(f"Chi2 statistic = {chi2:.2f}, p-value = {p:.5f}")
if p < 0.05:
print("=> Significant association between State and Crime Type (reject null hypothesis).")
else:
print("=> No significant association between State and Crime Type (fail to reject null hypothesis).")
# 2. Correlation analysis between Year and crime Value by Crime Type
print("\nSpearman correlation between Year and crime Value for each Crime Type:")
correlation_results = []
for crime in crime_df['Crime Type'].unique():
subset = crime_df[crime_df['Crime Type'] == crime]
corr, pval = spearmanr(subset['Year'], subset['Value'])
correlation_results.append({'Crime Type': crime, 'Spearman Correlation': corr, 'p-value': pval})
corr_df = pd.DataFrame(correlation_results)
print(corr_df)
# Interpret correlations
print("\nRecommendations based on statistical tests:")
if p < 0.05:
print("- Crime distribution varies significantly by State. Target interventions at high-crime states.")
else:
print("- Crime distribution does not vary significantly by State, suggesting uniform patterns.")
for _, row in corr_df.iterrows():
if row['p-value'] < 0.05:
trend = "increasing" if row['Spearman Correlation'] > 0 else "decreasing"
print(f"- '{row['Crime Type']}' shows a statistically significant {trend} trend over years (correlation = {row['Spearman Correlation']:.2f}).")
else:
print(f"- '{row['Crime Type']}' shows no significant trend over years.")
print("\nAdditional notes:")
print("- Consider socio-economic factors for deeper correlations.")
print("- Use regression or time series analysis for forecasting trends.")Chi-square test between State and Crime Type:
Chi2 statistic = 1039977.76, p-value = 0.00000
=> Significant association between State and Crime Type (reject null hypothesis).
Spearman correlation between Year and crime Value for each Crime Type:
Crime Type Spearman Correlation p-value
0 No. of Rape cases 0.129544 4.261818e-04
1 Kidnap And Assault 0.200640 4.021422e-08
2 Dowry Deaths -0.017134 6.426011e-01
3 Assault against women 0.151784 3.553061e-05
4 Assault against modesty of women 0.080348 2.928747e-02
5 Domestic violence 0.108515 3.201934e-03
6 Women Trafficking 0.449122 8.076209e-38
Recommendations based on statistical tests:
- Crime distribution varies significantly by State. Target interventions at high-crime states.
- 'No. of Rape cases' shows a statistically significant increasing trend over years (correlation = 0.13).
- 'Kidnap And Assault' shows a statistically significant increasing trend over years (correlation = 0.20).
- 'Dowry Deaths' shows no significant trend over years.
- 'Assault against women' shows a statistically significant increasing trend over years (correlation = 0.15).
- 'Assault against modesty of women' shows a statistically significant increasing trend over years (correlation = 0.08).
- 'Domestic violence' shows a statistically significant increasing trend over years (correlation = 0.11).
- 'Women Trafficking' shows a statistically significant increasing trend over years (correlation = 0.45).
Additional notes:
- Consider socio-economic factors for deeper correlations.
- Use regression or time series analysis for forecasting trends.